Image credits: Tools by priyanka, Snake by Chaowalit Koetchuea, Happy Panda by Royyan Wijaya from the Noun Project
What do you do if you love Python and all things data science, but you hate inefficiency or repetitive tasks? Luckily, there are tools that can help with some of the more mundane aspects of data-science work. I'd like to let you know about a couple that I've come across here in my early days in data science. Let's have a look at: (1) pandas profiling, (2) openrefine and (3) bamboolib.
What is it? How many times in your life have you typed df.head()? How about df['column].unique_values? I'm only a month into my data-science training and I've pounded out df.isnull().sum() on the keyboard countless times. Muscle memory is important, but when there are a 5-10 commands you're entering into your console repeatedly, sometimes you just want to focus your energy on analyzing the output rather than obtaining it. pandas profiling has come to your rescue! This package will run all of these commands and more on your dataset, with the use of a single command, and output a lovely, easy-to-ready HTML report for your review, highlighting key issues like missing values.
How to install it? pip install pandas_profiling
Any considerations? pandas_profiling may conflict with other tools in your toolkit. For instance, one day my LinearRegression.StandardScaler seemed to randomly stop working. If you experience something similar, check to make sure that pp is not the culprit.
See it in action!
What is it? Do you feel the pain when you have a city column and you have thousands of entries for Vancouver, vancouver, Vancoover, and so on and so forth! Openrefine, with its bulk text transformation tools, can save the day for you. It has a steep learning curve, but tools that are just as potent, including cleaning it; transforming it from one format into another; and extending it with web services and external data. Of course, any data scientist worth her salt can do all of those things too. But a tool like openrefine could come in handy in a time crunch. Those sort of misspellings I mentioned above? With its text clustering tool, openrefine can automatically detect and correct them.
Anything worth knowing? openrefine has a GUI but under the hood, it is based on its own scripting language, called GREL. Google used to own it, but now it's a free open source software project; woo hoo! There is also extensive free training for the tool available from RefinePro, a consultancy based in Toronto
How can I install it? Visit their website to download the code, or if you use anaconda, just type conda install openrefine
Can I see it in action? Yes! Watch the video below to learn about openrefine's powerful bulk text transformation tools.
What is it? Let's say you're not a veteran of the *NIX world, and you miss those good old windows? Engaging in data science doesn't mean you need to be GUI-less for the rest of your life! bamboolib offers a graphical interactto explore and transform pandas dataframes. You can quickly visualize statistics related to your dataset, and engage in most of the same feature engineering that you would do from the command line. Those of us who are still in our data-science infancy should still learn to do all of this by hand, but bamboolib could be a powerful aid to intermediate learners who know what they want to do but not exactly how - bamboolib provides Python code for its actions in real time, so you can study them later to reproduce the same results yourself. Or it could be enjoyed by veterans who want to see a bit of colour and flair without having to type in a dozen or more lines of matplotlib code. Its developer claims the tool will make your EDA ten times faster. Try it and see.
Anything I need to know? bamboolib is a for-pay endeavour, so you'd need to pay a monthly fee to use it ongoing on private data in an enterprise environment. That said, it can be used for free on Kaggle and Binder, so enjoy!
How can I install it? Get a free trial code from the bamboolib site, and then follow the installation instructions there. You'll need to enable it as a Jupyter Notebook extension from a terminal window, and the import it like any other module within your notebook.
Thanks for reading! I hope you may find some of these tools useful. See you next time!