The simplest type of sentiment analysis is dictionary based sentiment analysis. Dictionary-based sentiment analysis takes a word and gives it a sentiment score based on modern-day usage and definition. A text’s overall sentiment score is computed by averaging its words’ sentiment scores. A big drawback of dictionary-based sentiment analysis is that it does not take context into account. For example, the sentence “this family is not great” would be rated as positive because “family” and “great” would have positive sentiment scores. Another drawback is that that the dictionaries are based on modern usage and definitions. Therefore, their sentiment scores might not be accurate for our time period. We tried performing sentiment analysis using a few different dictionaries: Bing, NRC, and Vader. Vader and Bing seemed relatively accurate, but NRC seemed skewed positive. Both Bing and NRC are in R, while Vader is in Python. While the sentiment dictionaries results were promising, we wanted to try our hand at machine learning. The graph below compares the sentiment dictionaries Bing and NRC.
Because of our dataset, we knew that we could not use supervised machine learning. There were no sentiment score labels for our texts, and we were hesitant to use human sentiment labelling services like mturk due to our texts being in old English. We first tried a combination of Vader and BERT. BERT is a pre-trained neural network built for NLP tasks. We used Vader to create training and validation sets, making BERT learn the word sentiments from Vader but adding context. We also tried using a pre-trained BERT model for our training and validation sets. Unfortunately, both results proved to be inaccurate. To show how badly the BERT sentiment analysis went, there was a text that was marked positive that talked about a woman who murdered her husband and children by slitting their throat. We came to realize that because BERT is a shallow neural network and works with context, it is not compatible with the Bag-of-Words (BOW) that we were using. BERT needs sentence structure to inform its model. You can see code for this in our GitHub.
With BERT out of the picture, we turned our attention to Word2Vec and used it to try clustering. While Word2Vec also uses context to create embeddings, it is not dependent on sentence structure. The clustering method creates embeddings for all the words and based on the embeddings, tries to form positive and negative sentiment clusters. A weakness of Word2Vec soon came to light with clustering. If a word is used in 2 different contexts, unlike BERT, Word2Vec cannot detect the difference. For example, if there was a sentence that talked about apple the fruit, and another sentence that talked about Apple the company, Word2Vec would think that the 2 apples are the same word with the same meaning and create one embedding for both. As a result, as we fed more and more texts into our Word2Vec code, the clustering results become more and more inaccurate. Words that should have been positive ended up in negative, and vice versa. We are assuming that because Word2Vec is not context heavy enough, different word sentiments are being pushed together, causing words to see-saw back and forth between the 2 sentiments.
Once we realized that the above methods would not work for the purposes of our project, we pivoted to fine-tuning the VADER sentiment dictionary to analyze windows of text surrounding specific consumption words that we choose. The words we went with are gold, silver, wool, beer, and tobacco, and we chose five words on either side of the target consumption words to make up a ten-word context “window” surrounding each keyword. That way if a text is overall negative but has a positive sentence about gold, the context window would only pull in the positive sentence about gold, giving us a more accurate understanding of the sentiments surrounding the words that we are interested in. We also discovered that the VADER sentiment dictionary is customizable. We then updated the dictionary with words that were relevant in our lexicon and would likely differ from VADER’s standard score for them due to the fact that the VADER dictionary is more suited for modern contexts of these words. Here’s the list of words and their sentiment scores that we updated:
Upon updating the dictionary, we then ran sentiment analysis on these context windows and were able to see the sentiments surrounding the different consumption items, which ranged from the most negative sentiment of -1.0 to the most positive sentiment of +1.0. Using the philosophical : religion ratio that we computed with topic modeling, we separated the sentiments into 2 groups: all religious and religious + philosophy.
For next steps, we would love to go deeper into our period: examine more consumption items, analyze how sentiments shift over time, etc. Additionally, going into machine learning-- whether it be supervised or unsupervised-- could potentially provide a more accurate picture of sentiments. As mentioned above, during the 10 weeks of Data+, we tried implementing unsupervised machine learning using BERT. Unfortunately, the results were not accurate, which we attributed to the fact that the Bag-of-Words model does not work with BERT. It became obvious to us that as our research becomes more nuanced, the Bag-of-Words model no longer makes sense to work with.