File corpus creation.
Tokenization: breaking down text into smaller units called tokens
Document Term Matrix (DTM): a mathematical matrix that describes the frequency of terms that occur in a collection of documents. It has documents in rows and word frequencies in columns.
Stemming: process of converting words into their basis form, e.g. Words like win, winning and winner are converted and counted to their basic form -- win.
Stop Words: most common words in a language that get repeated, but add little value to text mining e.g. I, our, they’ll, etc. There are 174 stop words in English.
Bad Words: offensive words to be removed.
N-Grams (uni, bi and tri grams) generation.
Text Similarity
Jaccard Similarity
Cosine Similarity
N-gram Overlap
Levenshtein distance