A news recommendation subscription service
Our news subscription is FREE and personalized (based on user clicks). Users can read news titles and abstracts, if interested, click the news link, and give feedback on the recommendation.
•Potential user: everyone who wants to know the news, some professionals, international students, price-sensitive users
Welcome! Please use this link to sign up for this recommendation service: http://eepurl.com/hYLLdz
Please note: after you clicked the Subscribe button, you will receive news recommendation emails from us. For better experiences, add newsboardnyt@163.com to your email contact, to prevent email from being blocked
On the next day morning, you will receive an email including 10 news with news titles, news abstracts and URL. The URL is generated by us. In the sites, it includes the news title, abstract, original URL of the NYT, and plain text of the full news text.
At the end of the email, there's a feedback link to let us know your preference to improve your experience.
Below is an example of news sent by email.
We are using the New York Times news for the recommendation. New York Times API: https://developer.nytimes.com/apis
We crawled news by date. (use each day in February to test our functions)
The news has the following format:
[{"author": ["Paul Krugman"], "section": "Opinion", "abstract": "The push to make Americans\u2019 lives nasty, brutish and short.", "url": "https://www.nytimes.com/2022/01/31/opinion/republican-misinformation-coronavirus.html", "title": "Guns, Germs, Bitcoin and the Antisocial Right", "keywords": ["United States Politics and Government", "Bitcoin (Currency)", "Electric Light and Power", "Gun Control", "Coronavirus (2019-nCoV)", "Republican Party", "Abbott, Gregory W (1957- )", "DeSantis, Ron", "Florida", "Texas"], "publish_data": "2022-02-01", "id": "fe93a16a-33c1-5cee-a16d-0c957186afe7", "text": "In February 2021 a deep freeze caused widespread power outages in Texas, leaving about 10 million Texans without electricity…},{…},…]
Latent Factor models are a state-of-the-art methodology for model-based collaborative filtering. The basic assumption is that there exists an unknown low-dimensional representation of users and items where user-item affinity can be modeled accurately.
In this project, word2vec was used to embed the news title into a vector for similarity calculation. In title embedding parts, a simple strategy was used which sums up the embedding vector of each word that appears in the title (ignore the words that are not included in the word2vec model) and takes an average.
it's used to calculate news titles with previous news titles
For the first day of recommendation, we need user behavior data (click history) so we can create a utility matrix (#user * #item) to find latent factors to get the item matrix (k*#items) and user matrix (#users * k). Here training set of MIND (https://msnews.github.io/) was used. In the training set, it includes behavior data for constructing the utility matrix, and news data used as base news to calculate the similarity between the crawled news and each basic news.
Utility Matrix
Manipulate data and create the utility matrix. fill in the value: 1 if a user received the news but did not click, 2 if a user received the news and clicked. Fill in the missing value with zero. We don't care about these missing values but those values of 1 or 2.
Item Matrix
Calculate the weighted average according to the similarity to get each new news vector (k*1) and append it to the original item matrix. The item matrix could be updated (new shape: k*(#items + # new items)).
User Matrix
Get the subscribed user list using Mailchimp API. We assign a newly registered user as the average across all users per factor. Update user matrix ((# new users + # users)*k).
Recommendation
The prediction matrix is calculated by multiplying the updated user matrix by the updated item matrix. For each subscribed user, select the top 10 scores to recommend. Send the news using smtplib.
Refresh Click Data/ Survey
Update the Utility Matrix
Using utility matrix add columns (news) that are recommended to subscribed users.
Iterate the process
Matrix Factorization -> Crawl news -> update base news (= MIND + Day 1 news) -> generate sentence vector based on title -> calculate cosine similarity -> weighted average update item matrix -> check new user -> update User matrix -> matrix multiplication -> for subscribed user, calculate and sort recommendation score -> delete the recommendation list of the user already seen -> top 10 -> send email …
Candidate news window
set window = 2, only recommend 2 days of news
NYT News API: limitation
MIND: highly sparse.
How to track user clicks:
Implicit feedback: track clicks by developing a web application that needs users to sign in to read the news (Apple News)
If directly using the URL, one problem is some NYT news is not free, needs a subscription fee
We host this news in AWS S3, not able to track the clicks of each user
Instead:
Explicit feedback: survey, a simple link embedded in the email (save time)
To different types of users:
Only want to see the news of their interests; wants to see all the hot news from different categories, maybe add a question when signing up to ask them
Refine the news website, now the full text is displayed in plain text.
Try other recommendation methods or other advanced methods to get recommendations, and compare the result.
Add user unsubscribe function.
Make the recommendation process run automatically. Since we’re now using the survey provided by Mailchimp and the API does not allow us to get the survey report
Visit our Github repo to get more information about the project
including slides, code, data ...