To study oppositions (day/night and weekday/weekend), we used the Word2Vec model from gensim with the following parameters:
n_features = 300
min_word_count = 10
n_workers = multiprocessing.cpu_count()
window = 5
downsampling = 1e-2
seed = 1
sg = 1
epochs = 20
We then saved the output to data/embeddings/embeddings.emb to avoid unnecessary retraining.
Next, we trained the t-distributed stochastic neighbor embedding (t-SNE) model in sklearn.manifold to flatten the 300 dimensions of the vectorized words into a 2d plot with the parameters init='pca', learning_rate='auto'. The models were then saved as a raw array pickled to data/tnse_model and the pickled dataframe version to data/tsne_df.pkl.
Next, we used the following target sets to run our model on:
Day vs. Night
night_target = ["night", "midnight", "evening", "late", "dusk", "nocturnal", "afterhours", "overnight", "moonlight", "darkness"]
day_target = ["day", "morning", "noon", "afternoon", "early", "dawn", "daylight", "sunrise", "bright"]
Weekday vs. Weekend
weekend_target = ["weekend", "saturday", "sunday", "friday_night", "end_of_week", "rest_day"]
weekday_target = ["weekday", "monday", "tuesday", "wednesday", "thursday", "friday", "work_day"]
We then used the calculate_biased_score helper function from our custom utils library to calculate the list of biased words close to our target sets with the following parameters:
[day, night] = calculate_biased_words(model, night_target, day_target, 4)
[weekday, weekend] = calculate_biased_words(model, weekend_target, weekday_target, 4)
Finally, we used bokeh to generate interactive .html plots of these biased sets, as you may explore below.
red terms biased towards day, blue biased towards night
There exists clear binaries here between day terms and night terms. For one, the day terms seem to cluster primarily around school/work related terms like "junior_high", "teaching", "secretary", and "emails", with also a slight concentration around meal-related words like "coffee", "cafeteria", "lunch_break". On the other hand, night-heavy posts tend to cluster around words like "violated", "betrayed", "tension", and "intimacy". This shows a clear binary opposition between the structured and interpersonal public "day" life and the intimate and personal private "night" life. It seems to suggest that daytime confessions tend to be more about rule-bending within exterior institutional structures while nighttime confessions tend to be more introspective, skewing towards personal, private vulnerability with far greater emotional weight.
red terms biased towards weekday, blue biased towards weekend
Again, there exists a clear divide between weekday term and weekend terms. One could say that the differences are almost exactly like night and day (pun intended). Weekday terms appear to cluster, again, around school/work with words like "signed", "documents", "exam", and "attendance". Interestingly, there also seems to be a cluster around health and severe diseases with words like "diagnosed", "symptoms", and "injury". This is likely an artifact of healthcare being typically only available during weekdays. On the other hand, weekend terms tend to cluster around leisure and social life with terms like "party", "video_games", "boyfriends/girlfriends" and "beers". It appears that weekday language indexes obligation and accountability, and so weekday confessions should gravitate towards rule-breaking actions within institutional structures (in a fashion much like how day language does above). Meanwhile, weekend language suggest a skew towards hedonistic tendencies where transgressions can be viewed in a lighter manner. A possible binary relation of duty vs desire can be seen in this visualization.