Data & Model

Data 1: Daily Open & Close Prices from Yahoo Finance

We selected 2 industries for stock price predictions: cryptocurrency and energy. Cryptocurrency is a recently emerging industry, so we only took the price data from 2016 to 2021. On the other hand, the energy industry has a much longer history, allowing us to trace all the way back to 2009 (this is approximately the earliest time started from which we were able to fetch the news data). For each industry, we chose 3 representative stocks, BTC-USD, MARA, and RIOT for cryptocurrency and COG, DVN, and HFC for energy. The market open and close prices were obtained from Yahoo Finance using yfinance API.

Data 2: News Titles + Descriptions from Google News

We believe that the stock price fluctuation could possibly influence the news sentiment or vice versa. Therefore, we also utilize the news titles and descriptions from Google News for price predictions. We obtained ~26k news that contains the keyword bitcoin and ~73k news that contains the keyword oil, gas, energy, based on an existing script for scraping the Google News.

Software of Sentiment Analysis: VADER

VADER (Valence Aware Dictionary and Sentiment Reasoner) is a lexicon and rule-based sentiment analysis model specifically attuned to sentiments expressed in social media (Hutto & Gilbert, 2014). For a given text, the model computes the percentages of words that fall into positive, negative, and neutral. It also computes a normalized weighted composite score, where >= 0.05 indicates positive sentiment, <= -0.05 indicates negative sentiment and between -0.05 and 0.05 indicates neutral sentiment. Additional details of the model can be found in this paper and this GitHub repo. The left shows the composite scores for 3 news titles: "Thanks, bitcoin! Traders say goodbye to quiet weekends", "Should You (or Anyone) Buy Bitcoin", and "Coinbase hangover? Here's why bitcoin may be suffering its steepest slide since February". The model computes a positive, neutral, and negative sentiment in each case.

Data Preprocessing Pipeline

With both the open & close prices and news sentiments, our goal is to use the first 11 days of data to predict the close price on the 11th day and the open & close price for the following 4 days.

Long short-term memory model (LSTM)

Here we choose the LSTM structure based on Li et al. (2019), which predicts stock prices without any big compute or big data techniques, so it has a long execution time and we want to improve it. The LSTM structure gives good prediction results, as shown in the figure on the bottom right, where the trend of predicted and true stock prices match well. We believe that parallel data processing and parallel computing would make the stock price predictions faster and easier to use.

To test the scalability of the models, we built and trained models on data sets of different industries and time ranges, and then used parallel model training with GPU and cuDNN toolkit on Harvard Cannon to improve run times.

Page updated

Report abuse