Topic Modeling and Filtering

Connecting to Sentiment Analysis and Mistakes

Link to Our GitHub

Exploring Topic Modeling

We began our exploration of topic modeling with data scientist Julia Silge’s blog post, which detailed how to use the Structural Topic Modeling (stm) package in R. She used the Sherlock Holmes texts as her dataset and while her blog post provided key insights into the foundations of topic modeling, we ultimately pursued a different method of creating topic models since her method did not work with our particular dataset and goals. The key difference is that for our processing, we had to create one topic model per text, whereas she was creating one topic model for several texts. This meant that the parameters that she passed into her document-term matrix, or mathematical matrix that describes the frequency of terms occurring in a collection of documents, were different from ours. To realize this and make this change was a significant challenge, since we did not have too much background on DTMs and the different ways to construct them.

Link to Julia Silge's Blog Post

However, once we figured this out, we were able to successfully generate groups of related words for each text. We were also able to select how many words per topic and topics per text we wanted to generate, and after trial and error we decided on four words per topic and five topics per text so as to cover enough of the text’s themes without too much repetition.

Sample Topic Modeling Results

Here you can see how our code was able to pick out various topics from a singular texts.

Utilizing Topic Modeling

In terms of how we used topic modeling, our main purpose was to use it to filter out texts and pass along only the relevant ones to a new Box folder and to further stages of analysis, like creating the word embeddings. To determine relevance, we constructed lexicons against which the topic model output (twenty words) for each text was compared. We had two lexicons for this relevance determination - a philosophy lexicon and a religion lexicon.

In order to find text that we believed contained ethical topics that relate to consumption, we had to skillfully select the terms for our lexicon. Most of our religious terms came from common Christian phrases and well known parts of the Bible. Our readings came from Catholic and Protestant believers who influenced our knowledge of common Christain beliefs during the early modern period. Using this knowledge and the aid of our project managers, we included the books of the bible, the seven deadly sins, synonyms for Jesus, synonyms for God, phrases of worship, satanic concepts, and many other biblical concepts. For philosophical terms we studied the readings of the great, Greek philosopher Aristotle, considering the educated from the early modern period also studied Aristotle. We included terms from Aristotle’s Politics and Nicomachean Ethics. In these texts, we tracked common recurring concepts. Many of the Greek terms were neglected for not having a clear enough modern English or Latin translation that would have been kept in the text after our text cleaning process. We were able to include some Latin terms, names of other philosophers, fields of study, and Greek cities. However, in comparison to our religious lexicon, our philosophical lexicon was not extensive.

New Lexicon

Connecting to Sentiment Analysis and Mistakes

Additionally, for each text, the number of philosophical words in the topic modeling output and the number of religious words in the output were counted and represented in a ratio of philosophy:ratio, which helped us determine the main theme of the text. Also, only if there were matches with either of these lexicons would the text be added to the Relevant Files Box folder. The other texts were ignored.

We also decided to switch to analyzing consumption items, and use these philosophy:religion ratios to further our understanding of how the sentiments surrounding these consumption items change with the main theme of the texts (philosophy, religion, or both). However, we found some unexpected results which also led us to realize some mistakes in our process. To begin, there were no texts that were flagged as philosophy-dominant; they were all either religion-dominant or religion-only. We believe that this is because our religion lexicon was more extensive and also the time period and influence of the church lent itself to religion majorly influencing texts that were produced then. Additionally, the words gold and silver - which were two of the five consumption words that we were tracking - were part of our religion lexicon. We had originally included them since they appeared frequently in the Bible, and at the time of lexicon construction, we had not decided to focus on consumption items. By forgetting to remove gold and silver from our religious lexicon, these words were deemed as religious no matter what context they were put in. So even if they were used philosophically or simply for trade (which should have been removed from our final dataset during the first section of topic modelling), our code placed all text that contained gold or silver into the religious category and gave these texts a zero for the percentage of philosophical components. This definitely skewed our results. Unfortunately, we did not have the time to fix this. Ultimately, the code still produced interesting results but we think for further research this would be an error that should be fixed.

Link to Our GitHub

Page updated

Google Sites

Report abuse