To explore our research question, we began by assembling a massive dataset—over 2.5 million financial news articles spanning two decades. These articles, sourced from platforms like the New York Times, FNSPID, and various web scrapers, became the backbone of our sentiment analysis engine.
...But getting to this point didn't come easy. After merging and aligning content from multiple sources, we were left with major gaps in article volume—especially in the early 2000s where online financial news coverage was far and few between. The heatmaps below show the contrast: in 2001, some days had only a handful of articles; by 2015, many days had hundreds.
To build a robust predictive model, we needed consistent news coverage across our entire 20-year train/validation/test period. However, early 2000s coverage was sparse—posing a significant challenge for temporal consistency in our dataset.
To close this gap, we had to create a custom web scraping tool. Using a Python script, we searched for key terms like “economy” and “finance” via Google, retrieved relevant URLs, and scraped the content of each individual article. While time-intensive, this process filled crucial holes in our dataset and improved the continuity of our sentiment signal.
As shown in the figure below, article volume in the early years remained lower, but the fact that we had any coverage at all was a win. It allowed us to move forward with a complete and reasonably balanced dataset—critical for training our sentiment model over two decades of market history.
The end product of our data collection journey was a dataframe containing approximately 2.5 million rows of timestamped, curated article summaries. To align each article with market outcomes, we merged it with 1-day lagged S&P 500 returns, allowing us to associate today’s news with tomorrow’s market movement. We then converted these continuous returns into a binary signal—a label indicating whether the market moved up or down. This binary field (binary_return_lag) served as our target variable for model training.
To process our dataset, we turned to FinBERT—a BERT-based transformer model pre-trained on a large corpus of financial text. For each article, FinBERT produced sentiment probabilities, indicating the likelihood that the article was positive, neutral, or negative. As part of our Exploratory Data Analysis (EDA), We began by investigating whether the sentiment probabilities could serve as a strong predictive signal for next-day market returns.
As shown in the scatterplot below, we found no clear relationship between article sentiment and market movement. Whether news was classified as positive or negative, there was no consistent pattern in how the market responded the following day. This observation was confirmed by the correlation matrix below, which shows near-zero correlation between sentiment features and next-day returns—even when we applied various feature engineering techniques such as lagging, rolling averages, or multi-day smoothing.
These results highlighted a core challenge: financial markets are noisy, nonlinear, and often influenced by factors beyond textual sentiment... Oh, and not to mention they're often manipulated by big players who might coordinate with market makers behind the scenes to drive the price down on bullish sentiment, and vise versa...
...anyway, I digress.
The weak signal-to-noise ratio made one thing clear: simple linear models weren’t going to cut it. If we wanted to extract anything meaningful from this mess, we had to dig deeper, enrich our features, and embrace more sophisticated modeling strategies in the next phase.
In addition to leveraging FinBERT's sentiment probabilities, we extracted a dense vector embedding from each article’s CLS token—a 768-dimensional representation designed to capture the semantic meaning of the entire article in a format our models could learn from.
However, we quickly ran into another practical issue: the number of articles published per day varied widely. Some days saw hundreds of headlines while others only had a few. To ensure we'd end up with a single vector representation per day, we tested two aggregation techniques to combine article-level embeddings into a single daily “mood vector”:
Mean pooling: Averaged all embeddings equally, regardless of content.
Attention pooling: Weighted each embedding by its significance, giving more weight to impactful articles based on how strongly their sentiment diverged from neutral (i.e., the magnitude of positive or negative signal).
This aggregation step allowed us to distill complex, high-dimensional text data into a single structured input per day—a crucial step that enabled downstream predictive modeling.
With these daily sentiment vectors in hand, we moved into model development. Given the complexity of our dataset, we explored a range of non-linear classification models including Random Forests, Neural Networks (MLPs), and ensemble strategies that combined multiple architectures. Among them, one model consistently stood out: the Attention-Pooled Fine-Tuned Neural Network (shown as the GOLD line in the backtesting plot below). In this approach, we fine-tuned the final few layers of FinBERT and trained the model to classify whether the S&P 500 would rise or fall the next day.
While our best model achieved only 56% accuracy, it excelled in what matters most in finance—risk-adjusted returns. This model achieved the highest Sharpe ratio (0.44), experienced the lowest drawdowns (-34%), and demonstrated the most stable backtested performance across a variety of market regimes.
To evaluate our models, we used a simple trading simulation: starting with $1,000 in capital, we executed trades based on each model’s prediction. If the model predicted an “up” day (label = 1), we entered the market with all our capital and bought the S&P 500. If the model predicted a “down” day (label = 0), we exited the market and held cash. This strategy allowed us to directly measure the impact of prediction accuracy on portfolio growth over time.
The GREEN line in the chart represents a standard buy-and-hold strategy, included as a baseline for comparison. Notably, our best-performing model outpaced the buy-and-hold baseline.
Interestingly, while the ensemble model achieved the highest classification accuracy (62%), it underperformed in terms of financial returns. It struggled during regime shifts, highlighting a critical insight: in financial modeling, accuracy alone isn’t enough. True value lies in stability, adaptability, and the ability to adjust during changing market conditions.
The final output of our sentiment model is a probability score: a vector of continuous values representing the probability that the market will rise the following day. This score is fed directly into our larger Deep Learning Model as a feature to support market timing decisions, providing our overall framework with a subtle yet powerful signal drawn from financial news data.