Day 8

Today

- Debrief on warmup project data exploration notebooks
- Working with text data, text classification

For Next Time

- Complete the model iteration deliverable for the project

Debrief on Warmup Project Data Exploration Notebooks

I'll be discussing some suggestions for improving the data exploration that I saw in the first set of notebooks.

Building Your Own TF-IDF Vectorizer

We'll start by discussing the pre-class exercise. Next, we'll be building off this exercise to do post classification with the newsgroup dataset. There's a notebook in the repository under in_class/day08 that you can use as a starting point. You are certainly welcome to use your own TF-IDF implementation if you'd like.

More Advanced Text Processing

There is a really great end-to-end ipython notebook that will walk you through spam detection using scikit learn and a natural language processing library called TextBlob. To get the TextBlob library you can install it using pip:

$ pip install -U textblob

I recommend going through this notebook. You'll learn some really useful tricks both about NLP in Python as well as some useful scikit-learn tricks (specifically, you'll see an example of a pipeline).

Word2Vec

Google's word2vec has been making a lot of news lately. Here are some explanations of what it is (Google's explanation, some good conceptual overview but gets technical), and here is a live demo that you can play with.

If you want to go even deeper, here is a detailed tutorial on Kaggle about how to use word2vec for sentiment analysis of movie reviews (sound familiar?).

Page updated

Google Sites

Report abuse