When I was a young activist, I read a critical feminist sexuality text called Pleasure and Danger, by American anthropologist Carole S. Vance. I was always struck by its effective title and its succinct way of encapsulating the issues at stake. Now as a burgeoning data scientist faced with some of the rigors of my trade, such as data cleaning, the phrase comes in handy.
Well, yes. I know that nearly 60 percent of data scientists surveyed described data cleaning as their most hated task, but I actually enjoy it. Really? Sure. As an introvert, I gain comfort from order and predictability, and as a published author and longtime journalist, I know the satisfaction of putting things in their proper order for maximum effectiveness. But I also know there is such a thing as too much of a good thing.
In my earliest moments in data science, I was drawn to data cleaning. What metaphor can I use for it? Well, it's sort of like taking a shower. Except it's not you, it's the data. Well, I mean, yes — I do shower before cleaning data. Daily, I swear. OK, scratch that. Cleaning data is like being a barber. Or a hairstylist, whatever you prefer. You take something that is in a raw and untamed state and make it pleasant, likeable and predictable, like the haircut of some person your parents would approve of, with whom you might feel comfortable going on a dinner date. That's all very well and good, but if you have ever had a bad haircut, you know how things can go wrong. If cleaning data is like being a potter, what if you're handed a lump of clay that roughly looks like an obelisk, but by the time you are done with it, it looks more like an easter basket? I'll stop there. You're welcome. I think you get the point!
This brief post has two purposes. First, to look at some means of cleaning data and second, to discuss some of the risks and perils of doing so. The ideas in this post borrow from teachings by Matt Brems and Musfiqur Rahman at General Assembly. I also nicked a few great visualization ideas from Lianne & Justin @ Just into Data.
Several basic data-cleaning techniques are well-known and perhaps even widespread. Delete observations with missing data. Delete features with missing data! A dataset is mostly complete, but a bunch of info is missing from features related to respondents body weight or annual income? Well,