All raw data must be preprocessed before being able to be analyzed by our computational models. Our first steps were to:
Drop rows
where 'selftext' is either [removed] or [deleted].
with less than 4 characters to remove posts with no content.
with null values
Lemmatize text using spacy.load('en_core_web_sm').
Remove stop words using the stop words list in spacy.lang.en.stop_words.STOP_WORDS.
Append processed text to a new column named 'pp_text'.
Save as .csv file for easier access in the future.
In addition to these preprocessing steps, we also trained Gensim Phrases on the tokenized posts to look for bigrams with the settings min_count=10 and threshold=50 and then trigrams with min_count=10 and threshold=25.
Before:
'Title basically says it all. This wasn’t a professional organized competition. There was some bmx event with a couple hundred people and they always have a chili cook off. I’m not much for cooking so I thought it would be funny to throw a bunch of Wendy’s chili in a crock pot and see if anyone noticed - they didn’t. \n\nI’ve been a vegetarian for roughly twelve years so this was a long time ago.'
After:
'title basically says professional organized competition bmx event couple people chili cook cooking thought funny throw bunch wendy chili crock pot noticed vegetarian roughly years long time ago'