Data Preprocessing

As data was web scraped and follow conversational pattern. There were many inherent issues involved with text preprocessing steps. Most of the episodes followed similar structure, but there were some structural dissimilarity made cleaning process difficult and lengthy.

Text Cleaning

Basic string methods were used to:

Extract information such as episode category, guest name, episode date
Remove time stamps (00:00, 00:00:00)
Remove newline, tab and other characters
Fixed few structural errors (typos, contractions)
Fixed few repetitive use of words (like yes, yes, yes or wow, wow)

Data Splitting

Host and guest dialogues were extracted for each episode
Separated the data science and non-data science episodes for analysis

Others

For each analysis step data was preprocessed separately, some common steps were

Lowercasing the text
Based on analysis removing the digits and punctuations from text

Previous

Next

Page updated

Google Sites

Report abuse