The two primary sources of data for the sentiment model are the FNSPID dataset, a collection of finance-related articles from a wide variety of news sources, and The New York Times, specifically articles from their Business desk, which has article data available through an API. In conjunction, these two sources have a complete range from 2000 to 2024, though the year-by-year count increases dramatically after 2005. The goal of the sentiment model is to be able to produce a sentiment score for each business day, so these early years would need more articles in order to have text data consistently available for the entire time period of analysis.
While we were not able to get the yearly counts of those 5 years near the counts of the other years, we were able to successfully supplement the dataset enough to meet the criteria of having at least one article per day the market was open. This was accomplished by scraping news sites for articles, querying with finance/market/economy-related keywords and focusing on business sections, if the news agency had one, mostly through Google News. Thus, we were able to get a sentiment score for the next day's market returns based on the previous day's news, on all days within our timeframe.
Tools used include: