Data Sources

Our project integrates diverse datasets to power two core machine learning components: the Deep Learning Model and the Sentiment Model. For the Deep Learning Model, we sourced structured market and financial data—including stock returns, accounting ratios, and firm characteristics—from reputable data source WRDS, across different datasets including CRSP, Compustat, and JKP. In parallel, our Sentiment Model draws on over 2.5 million financial news articles from sources like FNSPID, the New York Times, and web scraping efforts, enabling us to explore the predictive power of text-based market sentiment. Together, these complementary data pipelines provide the foundation for building a robust, regime-aware portfolio optimization framework.

Deep Learning Model Data

Sentiment Model Data

The two primary sources of data for the sentiment model are the FNSPID dataset, a collection of finance-related articles from a wide variety of news sources, and The New York Times, specifically articles from their Business desk, which has article data available through an API. In conjunction, these two sources have a complete range from 2000 to 2024, though the year-by-year count increases dramatically after 2005. The goal of the sentiment model is to be able to produce a sentiment score for each business day, so these early years would need more articles in order to have text data consistently available for the entire time period of analysis.

While we were not able to get the yearly counts of those 5 years near the counts of the other years, we were able to successfully supplement the dataset enough to meet the criteria of having at least one article per day the market was open. This was accomplished by scraping news sites for articles, querying with finance/market/economy-related keywords and focusing on business sections, if the news agency had one, mostly through Google News. Thus, we were able to get a sentiment score for the next day's market returns based on the previous day's news, on all days within our timeframe.

Tools used include:

Go Back

Page updated

Report abuse