Week 8 - Tokenization and Clean-Up
Weekly Learning Goals
Understand the concepts of Tokenization, Lemmatization and Stemming
Model Text as Matrices
Explore the natural language processing package NLTK in Python
Weekly Learning Goals
Understand the concepts of Tokenization, Lemmatization and Stemming
Model Text as Matrices
Explore the natural language processing package NLTK in Python
Key concepts : stopwords, tokenization, lemmatization, stemming, Zipf's law.
Applying Tokenization and Cleaning up Text using NLTK
Resources:
This FA is based on this corpus. Please, download this file, and upload it to Jupyter Hub.
Note(10/18): I moved this item after Wed Class as wed did not have time to finish the Cleaning Notebook on Monday.
Please, find below another notebook written by our colleagues in Canada of the former Tapor project. For more information, please, visit their Github page at http://tapor.ca/home