Today, many of the worlds most heavily trafficked websites, such as Netflix, LinkedIn, Amazon, and Twitter employ recommender systems to engage their users with relevant, personalized content. We have tried to implemented an item-based recommendation approach in an efficient manner using Apache Spark.
In a typical collaborative filtering scenario, we have a list of n Products P = {P1, P2, ..., Pn} and a list of k users U = {u1, u2, ..., uk}. Let R be a n × k matrix where each Rp,I entry represents the rating of a user u about an product p with its value being a real number or missing. In item-based collaborative filtering, the algorithm infers a user’s preferences by computing the most similar items to each item they have interacted with.
Once we have the rating matrix R, we need to compute the similarities between items each user had interacted with. Similarity between items can be calculated by many different way, but we will be using one of the most popular cosine similarity measure.
Where R_px,py
represent the subset of products for which user u rated, r_u,px
is rating on product px and r_u,py
is rating u on product py. The cosine similarity is bounded by [0,1]. Once the similarity is calculated, we need to compute the top N recommendation for all users. This is done by iterating through the item interaction history of each user and computing the weighted sums score for each item’s neighbor items. In order to compute the recommendations, we will use the weighted sum approach which takes the average of the rating of the active products neighbors and weights each of them according to neighbor products similarity with current product.
where R_u,px
represents the subset of product py ∈ U other than px that have rated product p and r_px is the average user rating of product px. The weighted sums approach takes the average of the ratings of the active product’s neighbors, and weights each of them as per the neighbor product’s similarity with the active product.
Finally, we compute the top n product recommendations for a given user by finding the n products with the highest predicted rating . Since the predicted rating measures our prediction for how relevant a particular product is to the active user, we pick out the top n highest scored products from the weighted sums calculation.
We have used the Amazon product review data-set available online.
Review data was in JSON format and contained information like userID, productID and ratings.
Sample review data -
{
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
"reviewerName": "J. McDonald",
"helpful": [2, 3],
"reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!",
"overall": 5.0,
"summary": "Heavenly Highway Hymns",
"unixReviewTime": 1252800000,
"reviewTime": "09 13, 2009"
}
where,
reviewerID is the userID
asin is the productID
overall is the rating
Output gives us User ID and the list of products recommended to that user along with the ratings.
UserID, [(ratings1,ProductID1)(ratings2,ProductID2)]
Example of our recommendation:
For UserID A2XLYUJOX8V964, our system recommended him B005SUTAF8 (Fender Moto Style Guitar Picks), B001QUPFV2 (Audix CABGRAB1 CabGrabber Mic Clamp for Guitar Amps/Cabinets), B001LNO9I4 (SKB Les Paul Type Guitar Soft Case)
The recommendations are pretty good as they all are related to guitar.
For evaluating the accuracy of our recommendations computed by collaborative filtering, we use Mean Absolute Error (MAE), a popular evaluation metric for collaborative filtering algorithms.
MAE is computed by taking the average of deviations for every item in the user’s interaction history, for every user.
The lower the MAE, the more accurate the collaborative filtering algorithm is at predicting the item preferences of each user.
To compute the MAE score we divided the data into two part. 70% of the data was used as a training set and 30% of the data was used for testing the predicted ratings of the products.
MAE for our results was 0.82.
We can try different similarity measures like Pearson Similarity, Euclidean Distance to check if there is any change in the MAE.