For the ITCS 8190 course project, I considered property listing and review data from Airbnb for recommending properties to users based on the terms that are included in their query. A description is provided on the listing page for each property, which is written by the property owner for attracting guests. Typically, many of these descriptions include keyword terms that users normally include in their queries to increase the likelihood of the property being noticed. Depending on the guest’s overall experience, a review can be left using free text to help other Airbnb users decide whether to book the same property.
The objective of this project was to use the listing description and review comments from each property to determine which properties are “most similar” to what the user is looking for in their query. This can be accomplished using what is known as the vector space representation, where a given document and query are transformed into two different vectors and the cosine similarity score metric is calculated between them, which is bounded between [0, 1]. A higher similarity score indicates that the document and query are highly related, whereas a lower score indicates the opposite. The similarity score consists of two components, namely log-frequency weighting and inverse document frequency, which are calculated for each term based on the number of occurrences in the document, as well as the number of documents in the document collection that contain a particular term. To avoid having to process the listing descriptions and review comments each time a query is submitted, an inverted index is used, which can be described as a dictionary of distinct terms such that there is an associated postings list for each term, which contains the document ids for where the term appears. In addition to the project objectives mentioned above, it was required to apply distributed computing via Hadoop or Apache Spark.
Note that it is possible for certain properties to be more popular/unpopular than others, suggesting that the number of reviews for each property will be different. Since the objective is to recommend a set of property listings, I calculated the similarity score for all reviews associated with each property and took the average of these scores to provide an average similarity score for the property itself using the review data. Following this, I then merged the similarity scores that were obtained from the listing and review data to give an aggregated score for the property, which can then be used for ranking each property and suggesting a set of properties to the user. Figure 1 displays the workflow that was created in result of this project.
Figure 1: Workflow of proposed system
Airbnb is a popular medium where guests can find places to stay that are relatively cheaper than hotels. The motivation for this project is based on the demand for travelers that want to find accommodations that best suits their needs via Airbnb. I’ve had a personal experience with Airbnb that made me not want to use their service ever again because of the property’s condition when I arrived, so a tool that can process all listing descriptions and review comments to provide a subset of recommended properties can increase the likelihood that guests will have a positive experience. It appears that the Airbnb website does not allow users to search for properties using free text, but rather using check-in/check-out dates, desired destination, and the number of guests (as shown in Figure 2). Accomplishing such a task would be beneficial to Airbnb and their users since it is likely that the retention rate of users would increase for the former (which would ultimately lead to more revenue), and more occurrences of positive experiences for the latter.
Figure 2: Airbnb homepage with query fields [3]
In addition, I wanted to try something different than what I am used to for the project. I have taken classes and completed projects that apply machine learning, optimization, and other types of algorithms that process numerical-based data, which is typically provided in a structured fashion. I don't have a lot of experience working with text-based data, so I figured that this would be a good opportunity to practice working with it and develop a system that processes it in a distributed fashion (which is also a first for me).
I referenced the listing and review data provided by Inside Airbnb [1] for this project. Inside Airbnb provides this data for a number of cities around the world and it appears that they regularly update the data on their webpage. The city that I focused on was Asheville, NC because it has a relatively smaller data size compared to some other cities (i.e., New York City which had over 1 million records in their review data set), and I personally have never been there so perhaps I can use this system to plan a trip there using Airbnb for finding somewhere to stay. The Asheville, NC review data set contains 168,272 reviews, which are associated with 2,099 property listings. Both the listing and review data sets are provided as .csv files. In the listing data set, each row is associated with a single property and contains a field referred to as ‘description’, which contains the text written by the property owner. In the review data set, each row is associated with a single review for a given property and contains a field referred to as ‘comments’, which contains the text written by the guest. These are the fields that were referenced for creating the inverted indexes and calculating the similarity scores. The listing ids can be found in both data sets, so it was possible to perform a 1:1 mapping of the reviews to listings.
Prior to creating the inverted indexes, I had to exclude some of the reviews since they were written in non-English languages. Also, I had to exclude reviews that were too short (i.e., less than a paragraph long) or didn't have any comments at all. According to [2], 40-50 characters are typically used in a sentence. Given that a paragraph contains 3-5 sentences, I excluded reviews that were less than 120 characters in length OR had less than three punctuation characters (indicating that it was less than a paragraph long). For the latter, I accounted for the case where ellipses (or semi-ellipses) appeared in the comments and made sure to exclude them from the punctuation count. I wrote a Python script that uses the langdetect library for checking the language of a string, as well as checking if each review satisfies the length conditions that I enforced. Surprisingly, some of the property listings did not have a description, so I added a step that populates it with "NA" so the 1:1 mapping can hold. Note that this processing was performed on my own machine since the dsba-hadoop cluster does not have the langdetect library installed. After doing so, the updated data set contains 129,165 reviews, which are associated with 1,951 properties. Although both data sets were readily available and easy to retrieve, there were some inconsistencies in the descriptions and comments, which led to some issues when creating both of their respective inverted indexes. In the listing data set, some of the descriptions contained what appears to be HTML code, and in the review data set, different character encodings were used for some entries (see Figure 3). This will be discussed later in the report.
Unfortunately, I was unable to process all of these entries in a reasonable amount of time using the proposed system, so instead I considered a data set consisting of 10,000 reviews, which are associated with 223 properties (took approximately 80 minutes to process from beginning-to-end). It would have been nice to consider all of the data for this project, so perhaps future work includes optimizing the performance of my proposed system. Regardless, I learned a lot from working on this project and am proud that I was able to implement something that forced me to step out of my comfort zone.
Figure 3: Inconsistencies in listing and review data
As stated before, the vector space representation approach was used for recommending properties to users based on the terms that are included in their query. This required the creation of separate inverted indexes for the listing descriptions and review comments, which contain the distinct terms that appear in the text, along with a postings list that indicates the listings/reviews where they appeared in and the respective number of occurrences. Using the inverted index, all parameters that are required for converting each listing/review to a vector and calculating the similarity score with respect to the query can be determined. Both of these tasks were coded using Python and executed on the dsba-hadoop cluster using Spark for performing various transformations on the data in parallel.
I wrote two user-defined functions (UDF) for the inverted index task, which are responsible for processing each character that appears in a word (i.e., linguistic module) and representing the postings list of each distinct term as a Python dictionary. The input for the former task is a key-value pair represented as (documentid, word) and the output is also a key-value pair represented as (documentid, word) such that all punctuation is removed from the word and all letters are in lower case. After a key-value pair is processed through the linguistic module, a reduceByKey transformation is applied that creates another key-value pair represented as (word, documentids), which is then processed using the latter UDF in which the documentids are stored in a Python dictionary. Another operation is performed where the terms in the inverted index and the documentids in the postings list are sorted in ascending order. Suppose that a given word w appears in n different documents. Let tf_wi denote the term frequency of w in document i. Thus, an entry in the inverted index can be expressed as follows:
w {document_1: tf_w1, document_2: tf_w2, ..., document_n: tf_wn}
An example inverted index after executing this task is displayed in Figure 4.
Figure 4: example of inverted index
I wrote two UDFs for the vector space representation task, which are responsible for recovering the distinct terms that appear in each listing description/review comment and calculating the cosine similarity scores. The input for the former task using the listing data is a single key (listingid) and is used for calculating the tf-idf weights for each term that appears in the listing description and the length of the listing description. The output for this task is expressed as a tuple (listingid, tf-idf weights, length). The parallelize transformation is used to transform the list of all listingids into a resilient distributed data set (RDD), thus allowing these computations to be performed in parallel. Note that for the review data, the input is represented as a key-value pair (reviewid, listingid) and the output is represented as a tuple (listingid, tf-idf weights, length), respectively. This representation for the output is necessary to map each review to a particular listing. Similar to the listing data, the parallelize transformation is used to transform the list of all (reviewid, listingid) pairs into a RDD. Before the latter task is processed, the user's query is split into different components and all characters are lower cased. Following this, the listing/review ids and their respective tf-idf weights and lengths are used as input to the cosine similarity task and the output is represented as a key-value pair (listingid, similarity-score).
* To calculate the log frequency and inverse document frequency weights for the user's query, each query term is checked to see if it appears in at least one document (or listing/review) in the collection. If the query term t does not appear in at least one document, then the tf-idf weight with respect to t is set to zero. Otherwise, the log frequency component is calculated using the frequency of t in the query, and the inverse document frequency component is calculated based on the total number of documents that contain t. Once the tf-idf weights are calculated for the query and the document, both sets of weights are normalized based on their respective lengths (using the tf-idf vectors) and then multiplied together to retrieve the similarity score.
It was necessary to find the average similarity score for the review data since multiple reviews can be associated with a single listing. Once the similarity scores are computed, a reduceByKey transformation is applied to retrieve the similarity score for all of the listings from the RDD. Once the similarity scores are calculated using the listing and review data, it is necessary to merge these scores together to provide an aggregated score for the entire listing. Since the order of magnitude of the similarity scores might be drastically different between the listing and review data, it was decided to normalize all of the similarity scores (with respect to the data set they were computed from) within the range of [0, 1] where the listing with the lowest and highest similarity scores receive a value of 0 and 1, respectively. Let L denote the index set of all listing ids. The expression that was used for normalizing the similarity score of listing i with respect to data set j is show below:
(1)
After normalizing all of the similarity scores for both data sets, the weighted sum of the similarity scores for listing i is calculated to provide an aggregated score that considers both the listing and review data sets. Note that the aggregated score can no longer be interpreted as the cosine similarity metric, but rather a dimensionless score that can still be used to rank the listings. The similarity score merge is processed using a Python script that can be executed at the command line and doesn't rely on Spark to perform the computations. Let Greek letter alpha represent the weighted preference for the similarity score associated with the review data, which is bounded between [0, 1]. Thus, the aggregated score for listing i can be represented as follows:
(2)
To demonstrate the functionality of the proposed system, let's consider a user's query to see how properties are recommended based on the terms that are used. Suppose that 3,000 reviews are used for performing the analysis, which are associated with 107 properties. After processing the inverted indexes using the listing and review data (which consist of 2,621 and 8,270 entries, respectively), the vector space representation task can begin. Let's assume that the user typed in the following terms for their query: "house brewery pets clean hiking safe". It is likely that this particular user is looking for an Airbnb property that is classified as a house, has nearby local breweries, allows pets, is well-maintained and clean, has nearby hiking trails, and located in an area that is safe. Performing the vector space representation task using the listing data returns the top 20 recommended properties based on the cosine similarity metric (as shown in Figure 4).
Figure 4: top 20 properties based on the similarity score using the listing data
Figure 5: example output that is printed to the screen for the user
It can be noticed that property 1198090 had the highest similarity score, which was followed by properties 674894 and 246315. The output as shown in Figure 4 is not provided to user once the task is finished. Instead, the properties are listed in the same order, but the similarity score is replaced with a link that takes the user to the property page on the Airbnb website (example shown in Figure 5). The frequency count for each query term in the top three listings is shown in Figure 6. In addition, the document frequency of each query term for the listing data is displayed in Figure 7. The values from Figure 6 and 7 are necessary for retrieving the similarity scores between the query and the listing.
Figure 6: frequency of query terms in the top three listings
Figure 7: document frequency of query terms using the listing data
Note that the "brewery" term is the least frequent term in the listing collection whereas "house" is the most frequent used term. With that in mind, it makes sense that listing 1198090 has the largest similarity score since the tf-idf equation gives a higher preference to terms that appear less frequently (which would be "brewery" for this case). The next least frequent term is "pets", which appears twice in the description for listing 67894, which by default gives this listing a higher similarity score since it contains this less frequent term, as well as three other query terms (including "house", "clean", and "safe"). Listing 246315 contains "hiking" and "safe", which can also be considered as rare terms, but less rare than "brewery" and "pets". With that being said, it appears that the top three listings are ordered in such a way that is consistent with the rarity of the query terms in the document collection.
Executing the same task using the review data resulted in the following top 20 properties (as shown in Figure 8).
Figure 8: top 20 properties based on the similarity score using the review data
Clearly the same ordering of properties was not recommended when using the review data in comparison to the listing data. This was to be expected since their inverted indexes were constructed separately and have a different composition, and are calculated using the so-called "average similarity score" for the property listing. Note that the similarity scores using the review data are not of the same magnitude as the listing data, thus requiring both sets of similarity scores to be normalized (as discussed in Section 4) so that both components in the weighted sum are balanced.
The frequency count for each query term in the top three listings using the review data set is shown in Figure 9. In addition, the document frequency of each query term for the review data is displayed in Figure 10.
Figure 9: frequency of query terms in the top three recommended listings using the review data
Figure 10: document frequency of query terms using the review data
It was interesting to see that listing 3590643 was the top recommended property since only one query term ("hiking") appeared in both of its associated reviews. It appears that the term "hiking" is a relatively rare term in the document collection compared to the "house" and "clean" terms, so perhaps this is the reason why listing 3590643 was the top recommended property. I would suggest that the output in Figure 9 is not as easy to interpret in comparison to the output in Figure 6 since each entry in the former is associated with multiple reviews, whereas each entry in the latter is associated with a single listing. Looking at only the results from Figure 10, it would make more sense if listing 2477628 was the top recommended property since the frequency of the query terms in the reviews is much larger, however, this would lead to a biased outcome since properties that have the most reviews would have a more significant effect than properties with fewer reviews. Perhaps using the average similarity score metric mitigates this issue since two out of the three top properties only have two reviews.
After aggregating the similarity scores with alpha = 0.7, the following properties are suggested (as shown in Figure 11).
Figure 11: top 20 properties based on the aggregated score
Note that the first three properties in Figure 11 are the same as shown in Figure 6 because of a higher preference being assigned to the review similarity scores, but the ordering of the properties starts to change afterwards. In Figure 11, the fourth recommended property to the user is 634319, which ranked 17th and 7th in the listing and review data sets, respectively. This is surprising and such a difference can ultimately influence whether the user actually navigates to the property listing webpage (i.e., not very likely to navigate to it as the 17th recommended property, unlikely to navigate to it as the 7th recommended property). Similar behavior can be noticed for some of the remaining properties. The advantage of the aggregated score is that multiple data sources are used for recommending properties to the user rather than just a single one, which can significantly have an impact on the ordering of the suggested properties (as displayed in Figures 4 and 8). I am not suggesting that the aggregated score is the best metric for merging the listing and review data sets, but it offers an alternative way to analyze the data and, as show in Figure 11, does impact the ordering of recommended properties.
To evaluate the proposed system, it seemed reasonable to consider multiple problem instances where the number of listing descriptions and review comments are increased to analyze how the computational time is impacted in doing so. It was necessary to recreate the inverted indexes for the listing and review data for each instance. Modifying the inverted indexes would result in different similarity scores to be computed, so the vector space representation task also needed to be re-performed for each instance. The computational time for each task (namely inv-index and vector-space) was recorded (in minutes) for the listing and review data sets, separately, in an effort to see where most of the computational time was being consumed when executing the system from beginning-to-end. Since there are more review comments than listing descriptions, it was decided to manipulate the former parameter when generating the problem instances. In total, nine instances were generated consisting of 100, 500, 1,000, 2,000, 3,000, 4,000, 5,000, 6,000, and 10,000 reviews. The number of listings for each instance also increased in doing so, but not in a orderly fashion like the review comments. Besides, it was not necessary to modify both parameters since they were implemented almost identically to one another, with the only exceptions being mapping the review id to the associated listing id and calculating the average similarity scores for the review data. Figure 12 displays the output from the computational study.
Figure 12: output from the computational study for nine problem instances
It can be observed from Figure 12 that creating the inverted index for all instances took one minute to complete regardless of which data source was used (namely listings or reviews). For reference, the number of entries in the listings and reviews inverted indexes for the largest-size problem instance were 3,910 and 14,092, respectively. Also, it can be observed that performing the vector-space task took approximately one minute for all instances when considering the listing data. However, the amount of computational time that is required for performing the vector-space task when considering the review data increases rather dramatically as the number of review comments increases, requiring approximately 80 minutes for processing the 10,000 review comments. It is not surprising that the vector-space task took one minute for all instances since the same behavior was observed for the review data until the number of review comments reached 2,000, which required four minutes to process. Figure 13 displays a graph with the computational time that was required for all instances regarding the vector-space task when using the review data.
Figure 13: performance evaluation of the vector-space task with respect to the review data
From the graph, it appears that the computational time is increasing exponentially as the number of review comments increases. I was not able to consider any other instances with more than 10,000 review comments to gain more insight, but it is clear that the vector-space task is the bottleneck of the proposed system, particularly when processing the review data since the listing data is smaller in size. I tried my best to optimize the code that I developed for this task, but I unfortunately experienced some technological difficulties on my local machine that prevented me from investigating any further (Dr. Akella is aware of my situation). Perhaps future work includes to further optimize the implementation of this task to reduce the total computational time.
Before developing this system, Dr. Akella required students to categorize project deliverables into groups, namely "will definitely accomplish", "likely to accomplish", and "would ideally like to accomplish". This step was intended to help students create an organized plan on how to approach their project and hold themselves accountable for at least completing the most critical deliverables that were specified in the project proposal. The categorization of deliverables that I submitted for the project proposal are displayed below.
Will definitely accomplish
Inverted index for review data with term frequency (using Spark)
Vector space representation of the review data (using Spark)
Similarity scores of each property using the review data (using Spark)
Likely to accomplish
Inverted index for listing data with term frequency (using Spark)
Vector space representation of the listing data (using Spark)
Similarity scores of each property using the listing data (using Spark)
Combining the similarity scores from the review and listing data to create an aggregated score (using Python)
Ideally would like to accomplish
Apply a clustering/classification algorithm for determining if a given property is "good" or "bad" based on the review comments
Recommend properties to the user using these classifications
The original plan for this project was to only use the review data for recommending properties to users since more data was available in that data set. Clearly I was able to accomplish this (1 - 3), thus allowing me to also consider the listing data for generating aggregated scores for each property, which are then used for recommending properties to the user (4 - 7). Unfortunately, I was not able to consider 8 - 9, but I am okay with that since I was able to create an end-to-end system that satisfies the project objectives that were specified. It would have been nice to optimize the system further to reduce the amount of computational time that is required to consider larger-size problem instances, but hopefully I can reconsider this work at a later date to address this issue.
Some aspects of the system that I would address include the (a) inverted index, (b) linguistic module, and (c) the vector space representation. For (a), it would likely be advantageous to reduce the number of entries to make the computation of (c) more efficient. This can be accomplished by removing terms that are deemed "not useful" by subject matter experts at Airbnb, as well as performing some type of stemming algorithm to merge almost identical terms into a single entry in the inverted index. For (b), I had difficulty in handling characters that are encoded differently in the review comments and inconsistencies that appear in the listing descriptions (as displayed in Figure 3). I imagine that (b) is something that requires extensive effort to refine and continuous improvement, so it is not a trivial task to develop a linguistic module. For (c), I would need to further investigate the functionality of Spark and perform additional experiments to fine-tune its parameters to better utilize available resources in the cluster. However, the most helpful improvement for (c) would likely be in implementing algorithm in a more elegant fashion. Although there is room for improvement in the system that I developed, I learned a lot of new things regarding the implementation of a distributed computing system and gained experience in working with text-based data. I hope that I can continue to improve my skills in this domain because many organizations could benefit in offering this type of system to users in an effort to maximize customer satisfaction and increase the knowledge base of the organization to deliver solutions that better satisfy consumer demand.