Day 8

Today

For Next Time

Debrief on Warmup Project Data Exploration Notebooks

I'll be discussing some suggestions for improving the data exploration that I saw in the first set of notebooks.

Building Your Own TF-IDF Vectorizer

We'll start by discussing the pre-class exercise.  Next, we'll be building off this exercise to do post classification with the newsgroup dataset.  There's a notebook in the repository under in_class/day08 that you can use as a starting point.  You are certainly welcome to use your own TF-IDF implementation if you'd like.

More Advanced Text Processing

There is a really great end-to-end ipython notebook that will walk you through spam detection using scikit learn and a natural language processing library called TextBlob.  To get the TextBlob library you can install it using pip:

$ pip install -U textblob

I recommend going through this notebook.  You'll learn some really useful tricks both about NLP in Python as well as some useful scikit-learn tricks (specifically, you'll see an example of a pipeline).

Word2Vec

Google's word2vec has been making a lot of news lately.  Here are some explanations of what it is (Google's explanation, some good conceptual overview but gets technical), and here is a live demo that you can play with.

If you want to go even deeper, here is a detailed tutorial on Kaggle about how to use word2vec for sentiment analysis of movie reviews (sound familiar?).