Word Embedding and Cosine Similarity

Heatmaps for 1580-1584, 1600-1604, 1625-1629 embeddings, respectively

Link to Our GitHub

Purpose of Word Embedding

Our goal was to use the embeddings to track the association between various words in our dataset. Specifically, we hoped to examine whether or not we could find a meaningful distinction between religious and philosophical ethics. Given the interconnected nature of these two ethical frameworks, especially during this time period, this separation between religion and philosophy was not a trivial matter. We also wanted to determine what the relationship was between religious/philosophical terms and certain consumption items. This would allow us to estimate how these items were viewed under different ethical mindsets.

Word Embedding

To achieve this goal, we made use of word embedding algorithms. Word embedding algorithms are natural language processing tools that essentially take the words of a particular corpus and transform them into vectors in a mathematical vector space. Although the specific mechanisms of this process depend on the specific algorithm used, the end result is a powerful way to analyze large bodies of text via simple vector algebra. In this project, we used the Word2Vec Continuous Bag of Words model to generate our word embeddings. We divided our corpus into subsets based on which five-year period they belonged to (e.g. 1580-1584) and saved the word embeddings for each distinct period.

Embeddings of Random Words

Cosine Similarity

Our main tool for tracking these relationships was cosine similarity. Working with the vectors generated by Word2Vec, we can calculate the cosine of the angle between any two vectors using the dot product. The cosine value is a measurement of how related--i.e. how similar-- two words are. In order to take advantage of this technique, we devised the following strategy. First, we chose a selection of religious and philosophical terms based on our readings. We then computed the cosine similarities between each pair of words and plotted them into a heatmap. From the heatmap, we were able to observe that certain groups of words formed clusters. Particularly, we found that religious words formed one cluster, while philosophical words formed another. From this heatmap, we were able to conclude that there is a meaningful distinction between religious and philosophical ideologies. From here, we split our selection of terms into two sets based on which type of ethics they represented. In religious terms, we ended up with the following: Christ, heaven, holy, God, and faithful. For philosophical terms, we used: community, habit, ethics, Plato, and Aristotle. Then, we took our five consumption items of interest and computed the similarity scores for each word in either set. Finally, we computed the average similarity score by ethical category. We repeated this process for each five-year period and then plotted the average scores for each category over time.

Heatmaps for 1580-1584, 1600-1604, 1625-1629 embeddings, respectively

Cosine Similarity Graphs for Our Consumption Items

Link to Our GitHub

Page updated

Google Sites

Report abuse