Following Don Idhe's idea of instrument hermeneutics, our tools and methods function as the instruments through which r/confession gets filtered before arriving before our eyes. Through this filtering, the methods manipulate the data in such a way that can add, remove, or change the original raw meanings of the data. Each computational step introduces their own biases:
Preprocessing removing stop words and [removed]/[deleted] posts along with lemmatization and n-gram processing can decrease noise to allow computational methods to cleanly analyze the data. On the other hand stop-wording and lemmatization also flattens syntax and discards semantic meanings
Temporal partitioning with respect to a set time zone allows conclusions to be easily drawn about temporal structures in confessional genres, but treats every posts as if they were from the same time zone, potentially misrepresenting minority voices from other parts of the world
TF-IDF (global vocabulary) foregrounds cross-partition contrasts (or, in our case, cross-partition conformity) but can suppress localized slang and non-English posts
Word2Vec (skip-gram, window=5, min_count=10) emphasizes near sentence contexts of 5-grams while simultaneously filtering off rare tokens. This allows easy analysis of broad themes at the expense of niche language.
Target sets demonstrate the clearest form of bias due to the output being incredibly sensitive to changes in the target set, just a few minor edits can produce a completely new set of the "most biased" words.
t-SNE produces a visualization of words on a graph that clusters related words together, revealing topics that are prevalent in each temporal partition. However, it suppresses global relations as distances between far-apart clusters are not very meaningful.
In short, the computational pipeline does not only analyze the data, but also co-creates meaning alongside processing.
Following Don Idhe's idea of instrument hermeneutics, our tools and methods function as the instruments through which r/confession gets filtered before arriving before our eyes. Through this filtering, the methods manipulate the data in such a way that can add, remove, or change the original raw meanings of the data. Each computational step introduces their own biases:
Preprocessing removing stop words and [removed]/[deleted] posts along with lemmatization and n-gram processing can decrease noise to allow computational methods to cleanly analyze the data. On the other hand stop-wording and lemmatization also flattens syntax and discards semantic meanings
Temporal partitioning with respect to a set time zone allows conclusions to be easily drawn about temporal structures in confessional genres, but treats every posts as if they were from the same time zone, potentially misrepresenting minority voices from other parts of the world
TF-IDF (global vocabulary) foregrounds cross-partition contrasts (or, in our case, cross-partition conformity) but can suppress localized slang and non-English posts
Word2Vec (skip-gram, window=5, min_count=10) emphasizes near sentence contexts of 5-grams while simultaneously filtering off rare tokens. This allows easy analysis of broad themes at the expense of niche language.
Target sets demonstrate the clearest form of bias due to the output being incredibly sensitive to changes in the target set, just a few minor edits can produce a completely new set of the "most biased" words.
t-SNE produces a visualization of words on a graph that clusters related words together, revealing topics that are prevalent in each temporal partition. However, it suppresses global relations as distances between far-apart clusters are not very meaningful.
In short, the computational pipeline does not only analyze the data, but also co-creates meaning alongside processing.