Stock Market Volatility Meter

Exploring the practicality of using sentiment analysis techniques to predict stock market volatility

Abstract

The goal of our project is to predict the volatility of a company’s stock price using data from news articles through sentiment analysis. Our approach is to collect as much data as possible efficiently and employ NLP models on the collected data to make predictions on how volatile the stock price of a given company will be in the future

Introduction

In this project, we are trying to use NLP models to predict the volatility of the stocks of different companies and consequently analyze the results produced under different models. We trained our models on 3 different companies (results for Tesla and Pfizer are presented in this report but we also experimented on Apple data) all of which portrayed varying degrees of volatility and unpredictability. The question we are seeking to address here through NLP models is whether we can successfully predict the volatility of a company’s stock using news articles about the company. Though our model is trained on articles with content specifically revolving around financial data, a question that arises is if we can eventually develop a model that shows high accuracy on news articles from any platform.

Data Collection

We use BeautifulSoup4 to extract the HTML content under the article tag for the webpage of each article. Subsequently, we ignore all the data inside the HTML that we find irrelevant. Examples of this include the email of the author (which is common to MarketWatch articles) prefixed with “Email: ”, links to other articles that are prefixed with the label “Read More:”, and other hyperlinks. We used the fact that these sections of the article are prefixed with their labels to ignore these sections while scraping the data.

Stock price data is obtained from Yahoo Finance.

Automated Labeling

To label the articles about a company, the stock price data for that company was used. A 7-day moving standard deviation was calculated and smoothened (3-day moving average) to ensure consistency of labels.

The red line in the graph ("Pfizer stock") was used to determine if the stock was going through a period of volatility. A threshold, y = 1 was used and all articles, published during the time for which the smooth standard deviation value was greater than the threshold, were labeled ‘1’ indicating volatility.

Method

Preprocessing

many articles had irrelevant information. For example, some articles talked about several companies in the pharmaceutical industry, and information about Pfizer was present only in a single paragraph. This was a challenge for us because data about the performance of other companies would contaminate the training data for Pfizer. To solve this problem, we added a preprocessing step where we picked only 30 words after each mention of ‘Pfizer’.

Tokenization:

For tokenizing the preprocessed text, we tried two different approaches:

  1. Bag-of-Words model: We used Scikit-learn’s CountVectorizer to create the BoW representation and the resultant sparse matrix of the vocabulary was used to train various linear models. The 30-word text blocks obtained from the preprocessing step were split into a training and a testing set and the training set was used to fit the CountVectorizer as well as the linear models.

  2. BERT sentence embeddings: We used a pre-trained BERT model to extract features from the text in our dataset. The second to last hidden layer of the BERT model was used here to obtain a 768-dimensional word embeddings vector. The word embeddings for each row of our preprocessed data were used to obtain the sentence embedding vector for that row. Sentence embeddings were calculated by taking the arithmetic mean of all the word embeddings from that row. These vector representations of each row in our training dataset were then used to train a logistic regression model and a 3-layer neural network.



Classification:

  1. Logistic regression:

We trained a logistic regression model using the sentence embeddings from BERT.

  1. Neural network for classification:

We also trained a neural network on the BERT sentence embeddings. We trained a multilayer feedforward neural network with three layers of sizes 32, 16, and 2.

Cross-Validation:

We performed K-Fold cross-validation to test our best-performing model (logistic regression). We validated its performance over 5 folds and reported the average scores in the results section below. For evaluating the neural network, we split the dataset into three sets - train, test, and validation.

Results

Please email me at suraj.s.pathak[at]gmail[dot]com if you are interested in knowing the results of our experiments

Future Work

Since our dataset is being collected programmatically, this gives us additional leverage and efficient control over the training process. With the Python scripts we wrote, we can quite easily collect Marketwatch/Yahoo Finance data for different companies or increase the size of the collected data for existing companies. Furthermore, news data from other reliable sources could also be used for training along with the MarketWatch data. Another way to improve the reliability of the model would be to add data about more companies, preferably from different industries. For this project, we selected only U.S. companies. However, adding data of companies in other countries could potentially help create a more robust classifier.

News data contains critical information about how a company is performing. Traders rely heavily on news data for making trading decisions. For this reason, this model could be very useful for traders as well as for algorithmic trading because it can process data from multiple sources simultaneously and uses real-time data scraped from various sources and make projections about how the stock prices would vary.