Lihong Zhang, Mihuan, Li, Yichen Geng, Tianlei He
Many people use machine learning models, especially time series models to predict stock prices. But most existing models have long execution times due to the following problems.
They have massive input data
They have long model training time because time series models consist of serial computations.
Such models take a long time to wait for prediction results. Thus it's hard to implement user-friendly applications for people who do not know machine learning but still want to get predictions by machine learning models quickly.
We can solve the above problems by big computing and big data solutions as below.
Big data: parallel data processing with Spark and Hadoop
Big compute: parallel training of multiple models with GPU
In this website, we elaborate our project in 4 sections: Data& Model, Performance & Overheads, Software, Discussion
In Data& Model, we discuss the 2 data sources and the VADER function for sentiment analysis. We predict future stock prices by LSTM models (from Li et al. 2019) based on previous stock prices and news sentiments. For the input data, we have two main parts:
Data 1: Daily Open & Close Prices from Yahoo Finance
Data 2: News Titles + Descriptions from Google News
And we use the Valence Aware Dictionary and Sentiment Reasoner (VADER) for sentiment analysis. Details of data and model will be introduced in the Data & Model section below.
In the Performance & Overheads, we discuss the speedup of our parallel data processing and parallel model training. In the Software section, we present a ready-to-use software for stock price prediction. Finally, in the Discussion section, we consider our achievements and future improvements.
Our codes and documentation can be accessed on GitHub. Below is an overview of our workflow.
In the first steps, we obtained the raw data of Google News and Yahoo Finance Historical market data by get_news() and get_stock_price() in fetch_data.py.
Then we implemented the parallel data processing in general_preprocess.py with Spark and ran on AWS and obtained processed data.
Finally, we fed the processed data into LSTM models and trained the models with GPU and cuDNN on Harvard Cannon with the SLURM job manager.