As data was web scraped and follow conversational pattern. There were many inherent issues involved with text preprocessing steps. Most of the episodes followed similar structure, but there were some structural dissimilarity made cleaning process difficult and lengthy.
Text Cleaning
Basic string methods were used to:
Extract information such as episode category, guest name, episode date
Remove time stamps (00:00, 00:00:00)
Remove newline, tab and other characters
Fixed few structural errors (typos, contractions)
Fixed few repetitive use of words (like yes, yes, yes or wow, wow)
Data Splitting
Host and guest dialogues were extracted for each episode
Separated the data science and non-data science episodes for analysis
Others
For each analysis step data was preprocessed separately, some common steps were
Lowercasing the text
Based on analysis removing the digits and punctuations from text