Day 8
Today
Debrief on warmup project data exploration notebooks
Working with text data, text classification
For Next Time
Complete the model iteration deliverable for the project
Debrief on Warmup Project Data Exploration Notebooks
I'll be discussing some suggestions for improving the data exploration that I saw in the first set of notebooks.
Building Your Own TF-IDF Vectorizer
We'll start by discussing the pre-class exercise. Next, we'll be building off this exercise to do post classification with the newsgroup dataset. There's a notebook in the repository under in_class/day08 that you can use as a starting point. You are certainly welcome to use your own TF-IDF implementation if you'd like.
More Advanced Text Processing
There is a really great end-to-end ipython notebook that will walk you through spam detection using scikit learn and a natural language processing library called TextBlob. To get the TextBlob library you can install it using pip:
$ pip install -U textblob
I recommend going through this notebook. You'll learn some really useful tricks both about NLP in Python as well as some useful scikit-learn tricks (specifically, you'll see an example of a pipeline).
Word2Vec
Google's word2vec has been making a lot of news lately. Here are some explanations of what it is (Google's explanation, some good conceptual overview but gets technical), and here is a live demo that you can play with.
If you want to go even deeper, here is a detailed tutorial on Kaggle about how to use word2vec for sentiment analysis of movie reviews (sound familiar?).