TF-IDF

Our TF-IDF model uses sckitlearn's TfidfVectorizer with ngram_range=(1, 1), ngram_range=(1, 3), and ngram_range=(2, 3). That is, we focused our attention on unigrams, uni/bi/trigrams, and bi/trigrams. The reason for analyzing bi/trigrams on their own is because the majority of lemmas are unigrams and thus would, in effect, drown out the contributions of bi/trigrams. We also set max_df=0.85 to filter out possible stop words.

Next, we computed the mean TF-IDF per term using a global vectorizer and plotted the top 10 terms. The reasons for using a global vectorizer rather than a per-slice vectorizer is because we wanted to maintain strict cross-slice comparability.

Preliminary Analysis

Surprisingly, the results are fairly consistent throughout all partitions and even after restricting to strictly bi/trigrams. When unigrams are included, it appears that words like "like", "know", "feel", "time", and "want" are consistently weighted as the most important words in each section. When restricting to just bi/trigrams, one word stands out consistently in the bi/trigrams: "feel". The abundance of first-person, emotionally indicative words/phrases appearing at the top of TF-IDF charts, along with consistency in disparate temporal windows, show that, even when done digitally, confession is, at all temporal windows, still a deeply personal and emotional act.

Caveat:

It is important to note that the consistency in our results between disjoint partitions of our data may be explained by using a global vectorizer for each partition over using a separate vectorizer trained on each dataset. Later on, we shall uncover structures hidden from a simple surface level TF-IDF analysis by a combination of word embedding analysis and close hermeneutic reading.

Page updated

Google Sites

Report abuse