Bridging the gap between Pandas and Scikit-learn

Post date: Dec 30, 2014 8:38:35 AM

Pandas is a great tool for data analysis as it is able to represents the data as a data frame and provides tons of awesome tools. It feels like you are using R in Python except it's even better!

Scikit-learn is arguably the best machine learning package for python, not only because it is very well structured and designed, but also because its large spectrum of models and great support communities. The problem is that scikit-learn is developed based on numpy, and therefore not compatible with dataframe from pandas.

There are at least 2 ways to bridge the gap:

  1. Convert pandas to list of dictionary, binarize the categorical variable into 'one-hot-encoding' style and transform other variables using DictVectorizer.
  2. Using a package called sklearn-pandas, which will take care of all the procedures mentioned above in one shot.

I implemented a short demo of both approaches in iPython notebook here. The demo uses decision tree classifier as an example.

Useful resources

  • Also, I copy-and-paste the demo from sklearn-pandas page here.
  • This discussion in Quora is very helpful.
  • This post explains very well about the data preprocessing step in python. It also provides an interesting way to handle missing data in one-hot-encoding style that can improve the performance significantly.
  • pandas also has one-hot-encoding style available--see get_dummies().